Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Toward Eﬃcient Post-Training Quantization of Pre-Trained Language Models

131

convex combination between the full-precision fn and quantized ^ˆfn as follows:

˜fn = λfn + (1 −λ) ˆfn.

(5.11)

The hyperparameter λ controls the strength of teacher forcing. λ = 1 gives the full cor-

rection of reconstruction error but with forward inconsistency, e.g., the connection between

the current module and previous quantized modules is broken. While λ = 0 reduces for-

ward inconsistency, it suﬀers from the propagated reconstruction error. To achieve a good

trade-oﬀbetween reconstruction error reduction and forward inconsistency elimination, a

linear decay strategy for λ is proposed:

λt = max(1 −^t

, 0),

(5.12)

where T0 is the preset maximum steps of the decay. In the beginning, a large λ is desired

since each module is rarely optimized. Later, a small λ is preferred to transit to normal

training such that the forward inconsistency can be bridged. The remaining T −T0 steps

stick to normal training so that each quantized module adapts to its own predecessors.

The comparsion between the proposed method and other existing state-of-the-art BERT

quantization methods are presented in Table 5.4. From Table 5.4, both the proposed MREM-

S and MREM-P outperform existing PTQ approaches in most cases, and even achieve results

close to QAT approaches. For example, the “W4-E4-A8” quantized MREM-S and MREM-P

have the averaged accuracies of 83.5% and 83.4% on MNLI respectively are on par with

“W2/4-E8-A8” quantized Q-BERT. In terms of the “W2-E2-A8” quantized models, our

MREM-S and MREM-P surpass GOBO by 11.7% ↑and 11.3% ↑on MNLI-m, respectively.

In summary, this paper’s contributions are as follows: (1) module-wise reconstruction

error minimization (MREM) that is a fast, memory-saving, and data-eﬃcient approach

to improve the post-training quantization for language models; (2) a new model parallel

strategy based on MREM to accelerate post-training quantization with theoretical speed-

up for distributed training; and (3) annealed teacher forcing to alleviate the propagation of

reconstruction error and boost the performance.

TABLE 5.4

Results on the GLUE development set. “MREM-S” denotes sequential optimization.

Quantization #Bits (W-E-A)SizePTQMNLI-mQQPQNLISST-2CoLASTS-BMRPCRTEAvg.

full-prec.

418

84.9

91.4 92.1

93.2

59.7

90.1

86.3

72.2 83.9

Q-BERT

2-8-8

76.6

84.6

Q-BERT

2/4-8-8

83.5

92.6

Quant-Noise

83.6

TernaryBERT

2-2-8

83.3

90.1 91.1

92.8

55.7

87.9

87.5

72.9 82.7

GOBO

3-4-32

✓

83.7

88.3

GOBO

2-2-32

✓

71.0

82.7

MREM-S

4-4-8

✓

83.5

90.2 91.2

91.4

55.1

89.1

84.8

71.8 82.4

2-2-8

✓

82.7

89.6 90.3

91.2

52.3

88.7

86.0

71.1 81.5

MREM-P

4-4-8

✓

83.4

90.2 91.0

91.5

54.7

89.1

86.3

71.1 82.2

2-2-8

✓

82.3

89.4 90.3

91.3

52.9

88.3

85.8

72.9 81.6

Note: “MREM-P” denotes parallel optimization. “Size” refers to model storage in “MB”. “PTQ”

indicates whether the method belongs to post-training quantization.“Avg.” denotes the average

results of all tasks.